224 research outputs found
Adversarial Training in Affective Computing and Sentiment Analysis: Recent Advances and Perspectives
Over the past few years, adversarial training has become an extremely active
research topic and has been successfully applied to various Artificial
Intelligence (AI) domains. As a potentially crucial technique for the
development of the next generation of emotional AI systems, we herein provide a
comprehensive overview of the application of adversarial training to affective
computing and sentiment analysis. Various representative adversarial training
algorithms are explained and discussed accordingly, aimed at tackling diverse
challenges associated with emotional AI systems. Further, we highlight a range
of potential future research directions. We expect that this overview will help
facilitate the development of adversarial training for affective computing and
sentiment analysis in both the academic and industrial communities
Learning Audio Sequence Representations for Acoustic Event Classification
Acoustic Event Classification (AEC) has become a significant task for
machines to perceive the surrounding auditory scene. However, extracting
effective representations that capture the underlying characteristics of the
acoustic events is still challenging. Previous methods mainly focused on
designing the audio features in a 'hand-crafted' manner. Interestingly,
data-learnt features have been recently reported to show better performance. Up
to now, these were only considered on the frame-level. In this paper, we
propose an unsupervised learning framework to learn a vector representation of
an audio sequence for AEC. This framework consists of a Recurrent Neural
Network (RNN) encoder and a RNN decoder, which respectively transforms the
variable-length audio sequence into a fixed-length vector and reconstructs the
input sequence on the generated vector. After training the encoder-decoder, we
feed the audio sequences to the encoder and then take the learnt vectors as the
audio sequence representations. Compared with previous methods, the proposed
method can not only deal with the problem of arbitrary-lengths of audio
streams, but also learn the salient information of the sequence. Extensive
evaluation on a large-size acoustic event database is performed, and the
empirical results demonstrate that the learnt audio sequence representation
yields a significant performance improvement by a large margin compared with
other state-of-the-art hand-crafted sequence features for AEC
Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments
Eliminating the negative effect of non-stationary environmental noise is a
long-standing research topic for automatic speech recognition that stills
remains an important challenge. Data-driven supervised approaches, including
ones based on deep neural networks, have recently emerged as potential
alternatives to traditional unsupervised approaches and with sufficient
training, can alleviate the shortcomings of the unsupervised methods in various
real-life acoustic environments. In this light, we review recently developed,
representative deep learning approaches for tackling non-stationary additive
and convolutional degradation of speech with the aim of providing guidelines
for those involved in the development of environmentally robust speech
recognition systems. We separately discuss single- and multi-channel techniques
developed for the front-end and back-end of speech recognition systems, as well
as joint front-end and back-end training frameworks
Emergent Communication in Interactive Sketch Question Answering
Vision-based emergent communication (EC) aims to learn to communicate through
sketches and demystify the evolution of human communication. Ironically,
previous works neglect multi-round interaction, which is indispensable in human
communication. To fill this gap, we first introduce a novel Interactive Sketch
Question Answering (ISQA) task, where two collaborative players are interacting
through sketches to answer a question about an image in a multi-round manner.
To accomplish this task, we design a new and efficient interactive EC system,
which can achieve an effective balance among three evaluation factors,
including the question answering accuracy, drawing complexity and human
interpretability. Our experimental results including human evaluation
demonstrate that multi-round interactive mechanism facilitates targeted and
efficient communication between intelligent agents with decent human
interpretability.Comment: Accepted by NeurIPS 202
Implicit fusion by joint audiovisual training for emotion recognition in mono modality
A paper in ICASSP 201
Exploring perception uncertainty for emotion recognition in dyadic conversation and music listening
An article in Cognitive Computatio
Latency-Aware Collaborative Perception
Collaborative perception has recently shown great potential to improve
perception capabilities over single-agent perception. Existing collaborative
perception methods usually consider an ideal communication environment.
However, in practice, the communication system inevitably suffers from latency
issues, causing potential performance degradation and high risks in
safety-critical applications, such as autonomous driving. To mitigate the
effect caused by the inevitable latency, from a machine learning perspective,
we present the first latency-aware collaborative perception system, which
actively adapts asynchronous perceptual features from multiple agents to the
same time stamp, promoting the robustness and effectiveness of collaboration.
To achieve such a feature-level synchronization, we propose a novel latency
compensation module, called SyncNet, which leverages feature-attention
symbiotic estimation and time modulation techniques. Experiments results show
that the proposed latency aware collaborative perception system with SyncNet
can outperforms the state-of-the-art collaborative perception method by 15.6%
in the communication latency scenario and keep collaborative perception being
superior to single agent perception under severe latency.Comment: 14 pages, 11 figures, Accepted by European conference on computer
vision, 202
Generating and protecting against adversarial attacks for deep speech-based emotion recognition models
A paper in ICASSP 202
- …